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Abstract 

Background: Cancer subtype information is critically important for understanding tumor heterogeneity. Existing 
methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as 
hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among 
genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the 
clustering process. The motivation of our work is to develop a method that effectively incorporates molecular 
interaction networks into the clustering process to improve cancer subtype identification. 

Results: We have developed a new clustering algorithm for cancer subtype identification, called "network-assisted 
co-clustering for the identification of cancer subtypes" (NCIS). NCIS combines gene network information to 
simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights 
to genes based on their impact in the network. Then a new weighted co-clustering algorithm based on a 
semi-nonnegative matrix tri-factorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well 
as large-scale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. 
NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on 
the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering. 

Conclusions: The weighted co-clustering approach in NCIS provides a unique solution to incorporate gene network 
information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would 
otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data. 
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Background 

For a given type of cancer, there are often subtypes that 
harbor unique sets of genomic changes and exhibit dif- 
ferent patterns of gene expression [1-5]. Subtype infor- 
mation is critically important to tailor more effective 
treatments for patients, as varying subtypes often re- 
spond disparately to the same treatment. In the past dec- 
ade, many generic clustering-based approaches have been 
developed to identify cancer subtypes based on gene 
expression data. Typically, expression levels of d genes 
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measured on n samples are presented as a dxn real- 
valued matrix with the entries representing the corre- 
sponding expression level. A clustering method can be 
applied to partition the columns/rows of this matrix into 
different clusters such that items in one cluster have 
similar expression patterns. The partition of columns of- 
fers clues to potential cancer subtypes, while the partition 
of rows can highlight potentially relevant co-expressed 
genes. The most popular clustering methods used in can- 
cer subtype identification include hierarchical clustering 
and /c-means [6,7]. Recently, a number of other clustering 
methods have also been developed. Consensus clustering 
[8] is a clustering framework where the same cluster- 
ing algorithm is applied to different subsets of the data 
multiple times. A consensus result is then collected to 
better describe the similarities between samples. This 
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framework has been widely used in cancer subtype ana- 
lysis [9]. To address the high-dimensional feature space 
problem (which is almost always the case for gene ex- 
pression analysis), the method developed in [10] uses 
sparse clustering techniques to adaptively select a small 
set of informative features to cluster the samples. There 
are also several clustering methods that were specifically 
designed for cancer subtype clustering. In [11,12], 
survival time information was used to select survival- 
associated genes and then the samples were clustered 
using gene expression. In [13,14], an integrated approach 
was developed to consider multiple types of omics data 
(e.g. gene expression, mutation, copy number, methyla- 
tion) to help identify cancer subtypes. However, we are 
more interested in only utilizing the gene expression data 
(which is much more accessible than any other types 
of omics data) and getting the subtype information 
of a patient (based on molecular features such as gene 
expression) without using clinical features (such as sur- 
vival time). More importantly, most previous studies did 
not incorporate biological information, particularly mo- 
lecular interactions networks, into the clustering step. 
Indeed, network is key to understanding the molecular per- 
turbations in cancer [15,16]. If we consider the intercon- 
nection between genes during the clustering process, we 
have more knowledge of gene interactions at a systems 
level and may improve our ability to identify cancer sub- 
types. This will in turn allow us to analyze the perturbation 
of a group of genes and pathways rather than individual 
ones to better understand tumor heterogeneity. 

The motivation of our work is to develop a method 
that effectively incorporates molecular interaction net- 
works into the clustering process to improve cancer 
subtype identification. To this end, one previous work 
employed information of biological networks during the 
clustering [17]. The method first defined a network dis- 
tance based on the proximity of two genes in the network 
and an expression distance of two genes, and then con- 
structed the overall distance metric as a function of net- 
work and expression distance metrics for hierarchical 
clustering. Another recent work incorporated network 
information to cluster genotypes and phenotypes based on 
pheno type-gene association matrix [18]. The authors did 
so by adding penalty and regularization terms into the 
clustering objective to keep the final results consistent 
with clusters obtained from prior knowledge on the dis- 
ease phenotype similarity network. However, these two 
approaches are not appropriate for our cancer subtype 
identification. First, as our goal is to cluster cancer 
patients (not genes as in [17]), we cannot add a network- 
based distance in the distance metric defined for patients. 
Second, network-derived clusters [18] are also difficult to 
define for patients since there is no network structure link- 
ing all the patients (like the phenotype similarity network). 



Finally, simply combining network proximity-defined gene 
clusters directly with gene expression clusters may be mis- 
leading, since neighboring genes can have entirely different 
expression patterns. 

In this work, we introduce a new co-clustering algo- 
rithm to effectively integrate network information with 
expression variation across samples. We call our method 
"network-assisted co-clustering for the identification of 
cancer subtypes" (NCIS). The method first learns a 
weight for each gene as an indicator of its importance to 
be used in the clustering. The key idea is that genes 
regulating many other genes and showing highly variable 
expression patterns will be considered as more inform- 
ative in the clustering process. Another important con- 
tribution of this work is that we embed the gene weights 
directly into the co-clustering objective function. 

Co-clustering simultaneously clusters both samples and 
features [19,20]. In co-clustering, similarity is a measure of 
the coherence of features (e.g. genes) and samples in a bi- 
cluster, rather than a function of feature pairs or sample 
pairs. Consequently, it considers the local context and is 
able to automatically select subsets that share similar attri- 
butes [21,22]. The method we utilize in NCIS is based 
on Semi-Nonnegative Matrix Tri-Factorization (SNMTF) 
[23,24], a member of the matrix factorization-based clus- 
tering family. A common underlying assumption of such 
co-clustering methods is that there exist cluster centroids 
that characterize the behavior and trend of cluster mem- 
bers, which is mathematically formed as matrix tri- 
factorization. Matrix factorization has simple formalization 
when compared to other methods, and was shown to be 
useful in gene expression analysis [25,26]. To our know- 
ledge, NCIS is the first method to apply SNMTF to achieve 
weighted co-clustering in cancer subtype identification. 

Methods 

Method overview 

We developed a clustering method that incorporates the 
gene network (i.e. the interactions between genes) as prior 
knowledge and simultaneously cluster samples and genes 
into distinct groups. Adding network structure to the clus- 
tering step will help us better select representative genes 
for clustering. We expect that such a method will generate 
more biologically informative clusters. The main scheme of 
our method is shown in Figure 1. Note that we assume the 
input expression data has already been pre-processed such 
that the batch effects and technical artifacts are already 
removed. We implemented the algorithm in MATLAB. 
Source code and users' manual can be found at http:// 
bioen-compbio.bioen.illinois.edu/NCIS/. 

Assigning weights to genes 

Feature selection is essential for successful pattern rec- 
ognition from the high-dimensional data. In many previous 
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studies, genes were selected based on their median abso- 
lute deviation (MAD) or coefficient of variation (CV) 
[9,27]. The cutoff was set rather arbitrarily and typically 
only a small subset of genes was retained for subsequent 
analysis, which drastically reduces the amount of in- 
formation used in clustering. Other dimension-reduction 
methods such as principal component analysis (PCA) 
[28] are useful, but the biological interpretation is not al- 
ways straightforward as expression vectors of the samples 
are projected to a low-dimensional principal component 
space [29,30]. On the other hand, incorporating add- 
itional biologically relevant information as prior know- 
ledge could help resolve ambiguities in the data because 
it provides, to a certain extent, insight into how the gene 
expression profiles were generated. Therefore, we utilize 
the gene network as well as expression information to 
select genes that both play more important roles in the 
network and show larger variations among samples. 
Our method assigns a weight to each gene; genes with 
higher weights will be prioritized during the weighted co- 
clustering. 

We use a modified PageRank algorithm to assign 
weights to genes. The original PageRank [31] views the 
web as a directed graph. Suppose there are N nodes 
(web pages), then E is 2i NxN matrix denoting the con- 
nections among the nodes. A link from page / to page ; 
is shown by an edge pointing from node / to node and 
in the matrix form denoted by Eij = 1. More links to 
node ; raises its confidence level. The algorithm ranks all 
the nodes based on the iterative formula: 

^ E -r^~^ 
rf = {l-a)+ay^^,l<j<N, 

where is the confidence level in the n^^ iteration and 
deg^ = YljLi^ij > referring to the total number of web 
pages that / points to; a is a parameter representing the 
extent to which the ranking depends on the structure 



of the graph. In our case, we have a similar network 
representing the molecular interactions among genes. 
Our method that assigns weights to genes is similar to a 
gene ranking method, GeneRank [32], which extended 
the idea of PageRank to microarray gene differential ex- 
pression analysis. However, we use a directed graph (ra- 
ther than undirected graph in GeneRank), because we 
believe the direction of edges that models gene regula- 
tion is important. Additionally, rather than using differ- 
ential expression, we use an alternative method to 
consider gene expression variation across samples as de- 
scribed below. 

In our graph, a directed edge from node / to node ; 
means that gene / regulates the expression of gene 
Genes with larger variations among samples tend to have 
more distinguishing power. We incorporate such variation 
into the model to assign weights to genes. Our main idea is 
that genes having a lot of heavy-weight downstream targets 
should be assigned large weights - a rationale similar to 
the confidence vote in the original PageRank, except that 
outgoing edges increase a genes weight while incoming 
edges increase a web pages weight. Our weight-training 
approach is: 

^ E w^~^ 

< = (l-a)NMADj + aV^ ^\ ' ,\<j<N, 

EN 
._^Eji is the total number of genes 

that regulate gene /; w'^ is the weight vector of genes in 

the n^^ iteration and NMAD is the normalized median 

absolute deviation (MAD): 

NMADi = — , 

max(M4D) ' 

where max(AlA£)) is the maximum of vector MAD, 

We use MAD as a measurement of the expression 
variation of a gene among all the samples. The values 
are normalized by the maximum value in all MAD s to 
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make the weight-training mechanism stable and compar- 
able with different overall expression levels. In each iter- 
ation, every gene / is evaluated by its own MAD as well 
as the weights and connections of the genes that / regu- 
lates. The final weight of each gene reflects both its im- 
pact in the network and its ability to separate the 
samples. 

The convergence of this iterative algorithm is guaran- 
teed for any 0 < a < 1 [32,33]. Let w""^^ = w"", we have 



Table 1 Main notations used in this paper 



wj = {l-a)NMADj^a 



1<J<N. 



We can write in the matrix form as 

w"" = {l-a)NMAD + aED-^w"", 

where D is a diagonal matrix with entries deg/, l<i<N, 
The final weight for all the genes can be represented as: 



(l-a) X (l-aED 



X NMAD, 



where / is the NxN identity matrix. 

To make weights trained with different parameters 
more comparable, we normalized w such that the 
maximum of is 1. We chose a relatively large a value 
{a = 0.85) to make the weights rely more on the network 
structure. 

Weighted co-clustering algorithm 

After assigning weights to genes, our input data include 
the gene expression profile of each sample and the 
weights for all the genes from the previous step. We de- 
veloped a new weighted co-clustering method to simul- 
taneously separate samples into subtypes and group 
genes into functionally relevant subclasses. Our method 
is based on Semi-Nonnegative Matrix Tri- Factorization 
(SNMTF), where the nonnegative constraint on the data 
matrix imposed on Orthogonal Nonnegative Matrix Tri- 
Factorization (ONMTF) is relaxed to make it suitable for 
general dataset. 

Objective 

Suppose our gene expression matrix X contains d genes 
and n samples, and we would like to group the genes 
into m clusters and group the samples into c clusters 
(subtypes). For convenience, the main notations used in 
the rest of the paper are presented in Table 1. Our method 
can be specified as minimizing the following objective. 



WX-GSF^Wl- 



d 



(GSF'')^fxWu 



■ tr(X^ WX-2X^ WGSF^ + FS^ WGSF^ 



Here, G denotes the cluster each gene belongs to and F 
denotes the cluster of every sample. Entries of matrix S can 



Notation 



Description 



Number of samples 

Number of genes 

Number of sample clusters 

Number of gene clusters 

Gene expression matrix of size dxn 

The column of X, representing the expression 
of the sample 



n 
d 
c 
m 
X 



S 
W 



The / row of X, representing the expression of 
the /^^ gene 

Sample partition matrix of size n x c; Fij e[0,l]: 
Fij= 1 if X, belongs to sample cluster j and 
Fij = 0 otherwise 

Feature partition matrix of size d x m; Gjj e[0,l]: 
Gij= 1 \f Xj. belongs to gene cluster j and Gjj = 0 
otherwise 

A mxc matrix 

A dxd diagonal matrix; entries are the weights 
of the genes 



be treated as centroids of the blocks generated. The afore- 
mentioned weights are presented in the diagonal matrix 
W, and we incorporate an "importance indicator" by multi- 
plying the weights to the row (gene) norms. This is to 
prioritize genes with large weights in the optimization step. 
Due to difficulties in minimizing the objective with the 
binary-value constraint on F and G, we relax F and G into 
continuous nonnegative domain as in previous related 

m c 

work [24]. We only require ^G^y = 1, y^F^y = 1. Thus 
our objective is to minimize: 

/ = tr{X^WX-2X^WGSF^ +FS^G^WGSF^), 

m c 

s.t.G>0,F>0,^G^ = l,^F^ = l. 



Optimization 

We set: 



0, 



dS 

Then we have: 

S= {G^WGy^G^WXF{F^Fy\ 

We can get a clearer understanding of S from this 
expression. If G and Fare defined as in Table 1, i.e., 0/1- 
valued partition matrix, F^ F should be a c x c diagonal 
matrix, whose entries represent the number of samples 
belonging to each sample cluster. G^WG should be a 
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mxm diagonal matrix, with entries equal to the total 
weights of the features (genes) belonging to each of the 
m feature clusters. Similar to the interpretation of F, 
G^WG can be considered as the weighted total number 
of features in each feature cluster (taking feature / as Wi 
features when counting the total number). Therefore, 
{G^WG)'^ G^WX represents the feature cluster centroids 
on the sample space (f2-dimension) and XF(F^ F)'^ repre- 
sents the sample cluster centroids on the feature space 
(^/-dimension). The difference is that all the samples are 
assumed to have the same weight of 1, while features 
are assigned different weights W, Entries of S can be 
viewed as feature cluster centroids on the sample-centroids 
space (c-dimension) or as sample cluster centroids on the 
gene-centroids space (m-dimension). Therefore, S gives the 
centroids information of the bi-clusters after partitioning. 

Now, assume S and G are fixed. Let ^gR'^''^ be the 
Lagrangian multiplier for F, then Lagrangian function 
for F is 

L{F)=J-tr{/^F^). 

We set: 



dL{F) 
dF 



0, 



Using Karush-Kuhn-Tucker condition [34], we have 

+A- +FB^-FB-)..Fij = 0, 

where A = X^WGS, B = S^G^WGS; and M are the 
positive and negative of matrix M defined as = MhhM ^ 
M~ = ^ respectively. Therefore, we obtain the iterat- 



ing formula for F: 



KA-+FB^ 



Similar derivation leads to the iterative formula of G: 



Gir 



(C+ + WGD-)^. 
(C- + W^GD+)..' 



-SF^FS^, 



where C = WXFS^, D -- 

The iterations decrease the value of the objective func- 
tion /. Convergence of the algorithm can be shown using 
a typical approach for the convergence proof of NMF- 
based methods. For more details, see the proof in the 
(Additional file 1: Supplementary Materials). 
Our algorithm is as follows: 

• Initialize F and G. 

• While not convergent and iterations less than a 
pre-defined value 



Update S by 

S = {G^WG)~ ^G^WXF{F^F)- \ 
Update F by 

{A++FB-).. 



Update G by 



{C+ + WGD- 



m and c selection 

A question raised in almost all clustering methods is 
how to determine the cluster numbers. There is no 
agreed-upon solution. Here we utilize an approach that 
takes advantage of the stochastic property of our algo- 
rithm: although NCIS may not converge to the same so- 
lution on each run with different initiations, we could 
expect the results to be very stable if the clustering is 
strong enough [8,35]. As in [8,35], we ran NCIS 50 times 
with randomly selected initiations and defined a sample 
consensus matrix Ms and a gene consensus matrix Mg, For 
each run, 2i nx n sample connectivity matrix Mg and a 
d X d gene connectivity matrix Mg are obtained: 



Ms{ij) : 



Mg{ij) = 



1, if Sample i and Sample j belong to the same cluster 
0, otherwise ' 

1, if Gene i and Gene j belong to the same cluster 
0, otherwise 



Consensus matrices M^ and Mg are the averages of 
Af^s and M^s over 50 runs respectively. The entries 
range between 0 and 1, where 0 indicates that the corre- 
sponding samples (or genes) belong to different clusters 
in every run and 1 indicates that they belong to the same 
clusters in all the cases. Therefore, \-M offers a new 
distance metric between the items ( for samples 
and ^-Mg for genes). Similar to [35], we used I-M5 and 
^-Mg to hierarchically cluster samples and genes, and 
then we define an average cophenetic correlation coeffi- 
cient evaluate the stability. Cophenetic 
correlation coefficient /)(C) is defined as the Pearson cor- 
relation between distance matrix 1-C and the distance 
matrix induced by the linkage used in hierarchical clus- 
tering for re-ordering C. If a clustering is stable, the en- 
tries would be close to 0 and 1 (two modes), and in the 
ideal case (only 0 and 1) /)(C) would be exactly 1. We 
observe how the cophenetic correlation coefficients 
change as m and c change and select values where the 
averaged coefficient starts to decrease. 

Results and discussion 

We applied NCIS to two large-scale datasets from 
TCGA as well as simulated datasets to evaluate the ef- 
fectiveness of our method. We built the network using a 
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variety of sources, including the network used in [36] as 
well as our up-to-date curated information from Reac- 
tome [37], the NCI-Nature Curated PID [38], and KEGG 
[39]. The network from [36] consisted of inferred gene- 
interaction from sources of information such as protein 
interactions, gene co-expression, protein domain inter- 
action, and text-mined interaction described by [40]. To 
aggregate all of the networks together, all redundant 
edges were collapsed to single edges. We combined the 
edges of each of the databases such that a link between 
any two nodes A and B exists in the aggregated network 
if a link between A and B exists in any of the databases 
we used. The resulting aggregated network consisted of 
11,648 genes and 211,794 edges. Our method assumes 
that the network is an aggregation of different biological 
networks, such as protein-protein interaction network, 
transcriptional regulatory network, and signaling network 
etc. In the MATLAB implementation of our program, we 
also allow users to provide other network information 
as needed. 

Application to TCGA breast cancer dataset 

The first dataset we used is from a recent large-scale 
breast cancer study from TCGA [41]. This dataset con- 
tains the expression levels of 17,814 genes across 547 
samples. We first integrated the gene expression profile 
with the aggregated network information mentioned 
above, and trained weights for 8,726 genes included in 
both of these resources (for genes that are not included 
in either expression profile or network information, we 
ignored them). We set a = 0.85 (a is a tuning parameter 
that represents the extent to which gene weights rely on 
network structure; see Methods part). The 8,726 weighted 
genes and 547 samples were the input of NCIS. Figure 2 
shows the heatmap with genes and samples rearranged 
according to the NCISs clustering result. Based on the 
cophenetic correlation coefficient calculated from 50 runs 
(see Methods part), we chose number of patient clusters 
c = 5 and number of gene clusters m = 8 (Additional file 1: 
Table SI). 

Since we did not know the true class each sample be- 
longs to or the number of subtypes, we used clinical fea- 
tures to evaluate the effectiveness of the clustering 
algorithm. The underlying idea was that patients in dif- 
ferent subgroups were expected to have some different 
clinical characteristics. We used the following available 
clinical information to evaluate subtypes identification 
result: survival time, AJCC staging information (neo- 
plasm disease lymph node stage, neoplasm disease stage 
and tumor stage) and tumor nuclei percentage. AJCC 
neoplasm disease lymph node stage represents the stage 
of cancer based on the lymph nodes present. Neoplasm 
disease stage represents the extent of a cancer, especially 
whether the disease has spread from the original site to 
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Figure 2 NCIS result of the TCGA breast cancer expression data. 

Genes listed are the first 50 genes sliared between tlie ordered p-value 
list (based on ANOVA test of each gene's expression across the five 
subtypes) and the ordered gene weight list. 



other parts of the body. Tumor stage is a class assigned 
to a malignancy which allows for the grouping of similar 
cancer types based on the extent of disease in the pri- 
mary tumor, regional lymph nodes, and metastatic sites. 
Tumor nuclei percentage represents the percentage of 
tumor nuclei in a malignant neoplasm specimen (from 
TCGA data dictionary). Table 2 gives the significance 
level of the difference among all subtypes for each fea- 
ture. Given p-value threshold 0.05, we conclude that the 
NCIS (a = 0.85)-defined subtypes successfully separated 
the patients according to these clinical features. 

We also set a = 0 (no network information was used) 
in the co-clustering process to show the impact of net- 
work information. Similar statistical tests were performed 
(Table 2). In general, NCIS (a = 0.85) showed better 
p-values in separating the patients in terms of clinical fea- 
tures than NCIS {a = 0). 

In the original TCGA paper [41], the authors per- 
formed a hierarchical clustering using a subset of genes 
(most varied across samples) and identified 13 subtypes 
(test results for clinical features are shown in Table 2 as 
TCGA/BRCA paper). Since consensus hierarchical clus- 
tering generally performs better than the traditional 
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Table 2 P-value of the dependence test for different clinical features and breast cancer subtypes 



Method 


Survival 


Neoplasm disease lymph node stage 


Neoplasm disease stage 


Tumor stage 


Tumor nuclei percentage 


NCIS (a = 0.85) 


0.0444 


2.03x10"^ 


1.68x10"^ 


2.33x10"^ 


6.1 7 X 10"^ 


NCIS (a = 0) 


0.0442 


6.22x10"^ 


3.84x10"^ 


2.67 X 10"^ 


6.24 X 10"^ 


Consensus (k = 3) 


0.497 


0.123 


0.266 


0.175 


5.90x10"^ 


Consensus (k = 5) 


0.359 


3.29x10"^ 


2.08x10"^ 


0.187 


8.35x10"^ 


TCGA/BRCA paper 


0.831 


0.396 


0.337 


0.999 


0.0780 



For survival time, we used logrank test; for AJCC neoplasm disease lymph node stage, AJCC neoplasm disease stage, and AJCC tumor stage, we used Chi-squared 
test; for tumor nuclei percentage, we used ANOVA. Note that we did not use the normal-like subtype in this comparison. 



hierarchical clustering, we also applied a consensus average 
linkage hierarchical clustering [8,42]. To make a fair com- 
parison, we used all 8,726 genes. The program was run 
over 1,000 iterations and the resampling rate of the sample 
was set to 0.8. The distance metric was 1 minus Pearsons 
correlation coefficient. The algorithm suggested 3 subtypes. 
However, in Table 2, we listed the tests' p-values for both 
3-subtypes and 5-subtypes conditions to make it easier to 
compare with the results of NCIS. The results indicated 
that in general, clusters generated by consensus clustering 
were not as informative as those from NCIS. We think the 
most important reason is the lack of effective feature selec- 
tion in consensus clustering when there are a large number 
of genes as input. 

The advantage of NCIS is the incorporation of the net- 
work and assigning an "importance indicator" to each 
gene. Therefore, in addition to generating the subtypes, 
we also obtained a bi-product - the gene weights, which 
describe the genes' roles in the network and their abil- 
ities to distinguish the patient samples. We further per- 
formed ANOVA tests for each genes expression level 
across the five subtypes. In the heatmap in Figure 2, we 
showed the first 50 genes that are shared between the 
ordered p-value list and the ordered gene weight list 
(p-values are ordered from smallest to largest and gene 
weights are ordered from largest to smallest). To illustrate 
the difference for specific genes in the five subtypes at net- 
work level, we extracted the subnetwork of ABCC8 as an 
example (Figure 3). There are 9 genes targeted by ABCC8 
in the network we used. We chose this subnetwork be- 
cause it has a small size and is easily and clearly presented. 
Although we did not find literature studying the effect of 
ABCC8 in breast cancer, MRP has been reported to be 
highly associated with breast cancer tumor progression 
and treatment outcomes [43-45]. As shown in Figure 3, 
ABCC8 is highly expressed in Luminal A and Luminal B 
subtypes. Several of its downstream genes have very similar 
expression pattern. Such examples demonstrate the differ- 
ential expression pattern between subtypes at the network 
level and the advantage of prioritizing genes with higher 
impact in the network. 

The running time of NCIS {a = 0.85, c = 5, m = 8) on an 
8GB memory desktop for this dataset is about 5 minutes. 



Application to TCGA GBM dataset 

The second dataset we used was from a recent large- 
scale TCGA Glioblastoma Multiforme (GBM) subtype 
identification work [9]. This dataset contains the ex- 
pression of 11,861 genes on 200 GBM and 2 normal 
brain samples. In the original paper, the authors first 
selected 1,903 variably expressed genes according to 
the MAD and applied consensus hierarchical clustering 
with agglomerative average linkage [8]. Four subtypes were 
detected. 

We integrated the gene expression data with the net- 
work information to train a weight for each of the 7,183 
genes included in both sets (network and expression). 
Tuning parameter a was set to 0.85. After obtaining the 
weights, these 7,183 weighted-genes and the 202 samples 
were used in NCIS (result in Figure SI). We set m = 7 
and c = 4 (Additional file 1: Table S2). 

We again used clinical characteristics to evaluate the 
effectiveness of our method. We used survival time, 
tumor necrosis percentage, and tumor nuclei percentage. 
Tumor necrosis percentage represents the percentage of 
cell death in a malignant tumor specimen (from TCGA 
data dictionary). Additional file 1: Table S3 provides the 
significance level of the difference among all subtypes 
for each feature. We also ran consensus average linkage 
hierarchical clustering [8,42] on the 7,183-gene dataset. 
The program was run over 1,000 iterations and the re- 
sampling rate of the samples was set to 0.8. The distance 
metric is 1 minus Pearsons correlation coefficient. We 
identified 4 subtypes. Overall, NCIS {a = 0.85) performed 
the best. Interestingly, we observed that Subtype Pro- 
neural has a much higher survival rate than the other 
three subtypes (Additional file 1: Figure S2). The under- 
lying mechanism requires more study. In the heat- 
map in Figure SI, we also showed the first 50 genes 
that are shared between the ordered p-value list (based 
on ANOVA test of each gene s expression across the four 
subtypes) and the ordered gene weight list. Figure S3 
shows the subnetwork of CIQA, which is involved in 
immune response, to illustrate the difference among sub- 
types at network level. 

The running time of NCIS {a = 0.85, c = 4, m = 7) on 
an 8GB desktop for this dataset is about 2 minutes. 
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Figure 3 Expression patterns of ABCC8 subnetworic in breast cancer subtypes. Genes directly connected to ABCC8 and genes targeting 
ABCCS's downstream genes are included. Color of circle corresponds to gene expression level; size of circle corresponds to gene weight, (a) Subtype 
Luminal A; (b) Subtype Basal; (c) Subtype Luminal B; (d) Subtype HER2-enriched; (e) Subtype Normal-like. 



Evaluation by simulation designed a method to simulate gene expression data 

To further assess the performance of NCIS, we simu- based on network interaction structure (see Additional 
lated a dataset with 300 samples and 3 subgroups. We file 1: Supplementary Method). For the 3 subtypes we 
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simulated, the mean expression levels of each gene were 
estimated from the gene expression profiles of Luminal 
A, Luminal B, and Basal subtypes in the breast cancer 
dataset. The final datasets contained 300 samples and 
8,726 genes. 

To make the simulated datasets more realistic, noisy 
genes were added. We first trained a weight for each 
gene based only on the network structure and then chose / 
genes with lowest weights as "uninformative" genes. We 
randomly permutated the expression levels of these genes 
across the samples. / was set to 1000, 2000, 3000, 4000, 
and 5000 (we generated 5 datasets for each /). 

We set m = 8 and c = 3 in NCIS. The results for mul- 
tiple trials of the simulation studies were shown in the 
Figure 4. The running time of NCIS {a = 0.85, c = 5, m = 8) 
on an 8GB desktop is about 3 minutes for each simulated 
dataset. We found that when the number of "noisy" genes 
is small (1000 and 2000), both NCIS (a = 0.85) and 
consensus-clustering have 100% accuracy. When the num- 
ber of noisy genes is increased to 3000, NCIS (a = 0.85) 
starts to perform better than consensus clustering. As 
expected, once the number of noisy genes becomes exces- 
sive (5000 out of 9000), neither method can achieve high 
accuracy. Overall, our simulation result indicated that 
NCIS is a more robust method than consensus clustering 
to tolerate noise. 



eg 



d 




1 2 3 4 5 

# of noisy genes (x 1000) 



Figure 4 Accuracies on simulated datasets. NCIS (a = 0.85) vs. 
NCIS (a = 0) vs. consensus clustering on simulated datasets. Height 
of the solid boxes reflects the average accuracy in each setting 
(over 5 independent datasets simulated under the setting) and the 
bar indicates the standard deviation. P-value of paired one-sided 
t-test (25 data points for each group) for Hq: Accuracy (NCIS 
(a = 0.85)) < Accuracy (NCIS (a = 0)) is 0.0057. P-value of paired 
one-sided t-test (25 data points for each group) for Hq: Accuracy (NCIS 
(a = 0.85)) < Accuracy (Consensus clustering) is 0.001 9. 



We also observed that NCIS (a = 0.85) outperformed 
NCIS {a = 0) significantly in our simulated dataset 
(Figure 4). However, in the two real datasets, the advan- 
tage is marginal. We think the main reason could be 
that in our simulated datasets, the expression levels are 
strongly related to the network structure we collected 
(i.e. the interaction network well reflects how the gene 
expression is generated), while in real datasets there are 
more uncertainties and the network information we 
used is incomplete. 

Conclusions 

Cancer subtype information is of critical importance in 
designing better treatment strategies. We developed a 
novel clustering method, called NCIS, to identify cancer 
subtypes from high-throughput gene expression data. 
NCIS incorporates the network information within the 
clustering step to detect more informative sample sub- 
types. NCIS assigns a weight to each gene based on its 
connection in network and its distinguishing ability in 
expression level across all samples. Our approach avoids 
excluding a large number of genes, which results in 
much information loss for subsequent analysis in previous 
methods. In addition, we utilize a weighted co-clustering 
method to capture the duality of gene expression data, i.e. 
similarity is treated as a level of coherence of the samples 
and genes in the bi-cluster. 

The future directions of this problem should ideally 
address three key challenges. First, the network we used 
is assumed to be a generic molecular interaction net- 
work; it is not specific for the particular type of cancer 
or the tissue-type. Second, the network does not contain 
all the genes. Third, many edges in our current network 
do not have a high confidence level and the directions of 
many edges are unclear. These three problems can be 
addressed as we gain more complete understanding of 
the network. 

Further research is needed to design better approaches 
to choose the optimal parameters in NCIS, including a, 
c, and m. Since there is often no gold standard available 
for the clustering problem of a specific type of cancer, it 
is difficult to find the optimal parameters of a, c, and m. 
In our study, we use a = 0.85 to keep the balance be- 
tween network information and gene expression infor- 
mation. We did test the results using different values of 
(X, such as 0.8 and 0.7, and the results were comparable 
with minor differences in the clustering result. We be- 
lieve the problem of choosing the optimal a may require 
further studies when more data is collected through 
large-scale projects with detailed clinical features in the 
future. Such knowledge can be utilized to help select a. 
Additionally, how to determine the number of clusters 
(c and m) remains a difficult problem in clustering algo- 
rithms. In our work, we utilized cophenetic correlation 
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coefficients used in [35]. We compared the results using 
different m and c combinations where the cophenetic 
coefficients are slightly lower than the optimal combin- 
ation (Additional file 1: Table SI and S2). For both 
BRCA and GBM data, we observed that the subtypes 
identified were very similar (i.e., they had high correl- 
ation with the results from optimal combination) based 
on the p-values of a Chi-squared test between subtypes 
identified in the optimal combination. Therefore, the 
small variation in the choices of m and c results in very 
similar clustering here. However, further research is still 
needed to develop better approaches to automatically se- 
lect the most appropriate m and c. 

We believe our new NCIS algorithm will be useful to 
comprehensively identify cancer subtypes, which would 
otherwise be obscured by cancer heterogeneity, using 
high-throughput and high-dimensional gene expression 
data. Results from NCIS are expected to enhance our 
ability to discover important subtype patterns and key 
genes involved in each subtype, which will in turn help 
us better understand important network perturbations 
in a subtype-specific manner. 

Additional file 



Additional file 1: Supplementary materials. Available at BMC 
Bioinformatics online. 
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